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Abstract — Identification of defective members of large popula- 
tions has been widely studied in the statistics community under 
the name of group testing. It involves grouping subsets of items 
into different pools and detecting defective members based on 
the set of test results obtained for each pool. 

In a classical noiseless group testing setup, it is assumed that 
the sampling procedure is fully known to the reconstruction 
algorithm, in the sense that the existence of a defective member 
in a pool results in the test outcome of that pool to be positive. 
However, this may not be always a valid assumption in some 
cases of interest. In particular, we consider the case where the 
defective items in a pool can become independently inactive with 
a certain probability. Hence, one may obtain a negative test result 
in a pool despite containing some defective items. As a result, 
any sampling and reconstruction method should be able to cope 
with two different types of uncertainty, i.e., the unknown set of 
defective items and the partially unknown, probabilistic testing 
procedure. 

In this work, motivated by the application of detecting infected 
people in viral epidemics, we design non-adaptive sampling 
procedures that allow successful identification of the defective 
items through a set of probabilistic tests. Our design requires 
only a small number of tests to single out the defective items. 
In particular, for a population of size TV and at most K 
defective items with activation probability p, our results show 
that M = 0(K 2 log (N/K)/p 3 ) tests is sufficient if the sampling 
procedure should work for all possible sets of defective items, 
while M — 0(K log (N)/p 3 ) tests is enough to be successful 
for any single set of defective items. Moreover, we show that the 
defective members can be recovered using a simple reconstruction 
algorithm with complexity of O(MN). 

Index Terms — Group testing, probabilistic tests, sparsity re- 
covery, compressed sensing, epidemiology. 



I. Introduction 

INVERSE problems, with the goal of recovering a signal 
from partial and noisy observations, come in many different 
formulations and arise in many applications. One important 
property of an inverse problem is to be well-posed, i.e., there 
should exist a unique and stable solution to the problem (2j. In 
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this regard, prior information about the solution, like sparsity, 
can be used as a "regularizer" to transform an ill-posed 
problem to a well-posed one. In this work, we look at a 
particular inverse problem with less measurements than the 
number of unknowns (ill-posed) but with sparsity constraints 
on the solution. As will be explained in detail, the interesting 
aspect of this problem is that the sampling procedure is 
probabilistic and not fully known at recovery time. 

Suppose that in a large set of items of size N, at most 
K -c N of them are defective and we wish to identify 
this small set of defective items. By testing each member 
of the set separately, we can expect the cost of the testing 
procedure to be large. If we could instead pool a number 
of items together and test the pool collectively, the number 
of tests required might be reduced. The ultimate goal is to 
construct a pooling design to identify the defective items while 
minimizing the number of tests. This is the main conceptual 
idea behind the classical group testing problem which was 
introduced by Dorfman [3| and later found applications in a 
variety of areas. The first important application of the idea 
dates back to World War II when it was suggested for syphilis 
screening. A few other examples of group testing applications 
include testing for defective items (e.g., defective light bulbs 
or resistors) as a part of industrial quality assurance H), 
DNA sequencing [5], DNA library screening in molecular 
biology (see, e.g., |6)- fT0) and the references therein), multi- 
access communication ]1 1| , data compression \12\, pattern 
matching [13], streaming algorithms fl4) , software testing (JT3J 
and compressed sensing [16]. See the books by Du and Hwang 



for a detailed account of the major developments in this 
area 07), (18). 

In a classical group testing setup, it is assumed that the 
reconstruction algorithm has full access to the sampling pro- 
cedure, i.e., it knows which items participate in each pool. 
Moreover, if the tests are reliable, the existence of a defective 
item in a pool results in a positive test outcome. In an 
unreliable setting, the test results can be contaminated by 
false positives and/or false negatives. Compared to the reliable 
setting, special care should be taken to tackle this uncertainty 
in order to successfully identify the defective items. However, 
in some cases of interest, there can exist other types of 
uncertainty that challenge the recovery procedure. 

In this work, we investigate the group testing problem 
with probabilistic tests. In this setting, a defective item which 
participate in a pool can be inactive, i.e., the test result of a 
pool can be negative despite containing some defective items. 
Therefore, a negative test result does not indicate that all 
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Fig. 1. Each element of the sampling matrix is generated independently 

from the corresponding element of the contact matrix by passing 

through a channel. The zeros in the contact matrix remain zeros in the 
sampling matrix while the ones are converted to zeros with probability 1 — p. 




the items in the corresponding pool are non-defective with 
certainty. We follow a probabilistic approach to model the 
activity of the defective items, i.e., each defective item is active 
independently in each pool with probability p. Therefore, the 
tests contain uncertainty not in the sense of false positives or 
false negatives, but in the sense of the underlying probabilistic 
testing procedure. More precisely, let us denote by the 
designed contact matrix which indicates the items involved in 
each pool, i.e., 

jy(c) f 1 if test i includes item j 
^ y otherwise. 

The probabilistic tests are then given by 
y = M (s) • x 

where Af*- 5 -* denotes the probabilistic sampling matrix, x is the 
sparse input vector and y denotes the vector of test results. 
Each element of the contact matrix is independently 

mapped to the corresponding element of the sampling matrix 
Af*- 5 ' by passing through the channel shown in Figure [T| fl9) . 
In fact, the zeros in the contact matrix remain zeros in the 
sampling matrix while the ones are mapped to zeros with 
probability 1 — p. 

In this work, our goal is to design efficient sampling 
and recovery mechanisms to successfully identify the sparse 
vector x, despite the partially unknown testing procedure 
given by the sampling matrix M^ s \ Our interest is in non- 
adaptive sampling procedures in which the sampling strategy 
(i.e., the contact matrix) is designed before seeing the test 
outcomes. In our analysis, we consider two different design 
strategies: In the per-instance design, the sampling procedure 
should be suitable for a fixed set of defective items with 
overwhelming probability while in the universal design, it 
should be appropriate for all possible sets of defective items. 
We show that M = O (K log (AT) /p 3 ) tests are sufficient 
for successful recovery in the per-instance scenario while 
we need M = O (K 2 \og(N / K) / p 3 ) tests for the universal 
design. Moreover, the defective items can be recovered by a 
simple recovery algorithm with complexity of 0(MN). For 
a constant parameter p, the bounds on the number of mea- 
surements are asymptotically tight up to logarithmic factors. 
This is simply because standard group testing that corresponds 



Fig. 2. Collective sampling using agents in viral epidemics. (•) symbols 
represent healthy people while (0) symbols indicate infected ones. The dashed 
lines connect the individuals contacted by the agents. An agent may remain 
healthy despite having contact with some infected people. 

to the case p — 1 requires M — Q,{K 2 \og K (N)) non- 
adaptive measurements in the universal setting (cf. JT7J Ch 7]) 
and M = fl(K\og(N/K)) measurements in the per-instance 
scenario (by a "counting argument"). 

The above-mentioned probabilistic sampling procedure can 
well model the sampling process in an epidemiology appli- 
cation, where the goal is to successfully identify a sparse 
set of virally-infected people in a large population with a 
few collective tests. In viral epidemics, one way to acquire 
collective samples is by sending "agents" inside the population 
whose task is to contact people. Once an agent makes contact 
with an "infected" person, there is a chance that he gets 
infected, too. By the end of the testing procedure, all agents 
are gathered and tested for the disease. Note that, when an 
agent contacts an infected person, he will get infected with a 
certain probability typically smaller than one. Hence, it may 
well happen that an agent's result is negative (meaning that he 
is not infected) despite a contact with some infected person. 
One can assume that each agent has a log file by which one 
can figure out with whom he has made contact. One way to 
implement the log in practice is to use identifiable devices 
(for instance, cell phones) that can exchange unique identifiers 
when in range. This way, one can for instance ask an agent to 
randomly meet a certain number of people in the population 
and at the end, learn which individuals have been met from 
the data gathered by the device. However, one should assume 
that when an agent gets infected, the resulting infection will 
not be contagious, i.e., an agent never infects other peopl^j] 
In the above model, the agents can in fact take many forms, 
including people who happen to be in contact with random 
individuals within the population (e.g., cashiers, bus drivers, 
etc.). 

The model explained using the epidemiology example above 
can in fact capture a broader range of settings, and in partic- 
ular, any group testing problem where items can be defective 

'This assumption is reasonable with certain diseases when there is an 
incubation time. 
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with a certain probability. An example of such a setting is 
testing for faulty components (or modules) in digital logic 
systems (e.g., an integrated circuit). This can be modeled 
through a probabilistic setting where the probability p denotes 
the percentage of time that a faulty component does not work 
correctly. In this way, one can use our group testing results 
to efficiently localize the few unreliable circuitry elements out 
of a large set of components on a given chip. One should 
note that this model generalizes the classical application of 
group testing for fault detection in electronic circuits where 
components are assumed to be either fully reliable or fully 
unreliable (see |20|). 

The organization of this paper is as follows. We first give 
an overview of the related work in Section [IT] which is then 



followed in Section III by a more precise formulation of our 
problem. In order to solve the original stochastic problem, we 
first solve an adversarial variation of it in Section [IV] which 
we find more convenient to work with. Then, in Section [Viand 
by using the results obtained from the adversarial setting, we 
design sensing and recovery procedures to efficiently solve 



the original stochastic problem. In Section VI we provide 
a systematic design procedure which provides us with the 
exact value for the number of tests, along with the other 
necessary parameters, as a function of the desired probability 
of unsuccessful reconstruction. We evaluate the design by 
doing a set of numerical experiments. The paper is summarized 
in Section IViTl 

II. Related Work 

A large body of work in classical group testing has focused 
on combinatorial design, i.e., construction of contact matrices 
that satisfy a disjunctness property (the exact definition will 
be provided in Section |TV) . Matrices that have this property 
are of significant interest since they ensure identifiability of 
defective items and moreover, they lead to efficient recovery 
algorithms. This property has been extensively studied in [21 1- 
p5) . By using probabilistic methods, authors in pT) developed 
upper and lower bounds on the number of tests/rows for the 
contact matrix to be if -disjunct. More precisely, they showed 
that the number of rows should at least scale asymptotically 
as O (K 2 log N/ log K) for exact reconstruction with worst 
case input. On the other hand, a randomly generated matrix 
will be If -disjunct with high probability if the number of rows 
scales as 0(K 2 log(N/K)) J22J. Having a if -disjunct matrix, 
one can devise an efficient reconstruction algorithm to identify 
up to K defective items. This is true if the reconstruction 
algorithm fully knows the sampling procedure. However, in 
our scenario, the decoder has to cope simultaneously with two 
sources of uncertainty, the unknown group of defective items 
and the partially unknown (or stochastic) sampling procedure. 
For this reason we need to use a more general form of 
disjunctness property. 

We should also point out the relationship between our setup 
and the compressed sensing (CS) framework [26], |27|. In CS, 
a random projection of a sparse signal is given and the goal is 
to find the position as well as the value of the non-zero entries 
of the input signal while keeping the number of projections to 



a minimum. Exploiting the similarity between group testing 
and CS, new recovery algorithms for sparse signal recovery 
have been proposed in [28] and [ [29) . Although in CS the 
goal is to measure and reconstruct sparse signals with few 
measurements, it differs in significant ways from our setup. 
In CS, it is typically assumed that the decoder knows the 
measurement matrix a priorj^] However, this is not the case in 
our setup. In other words, by using the language of compressed 
sensing, the measurement matrix might be "noisy" and not 
precisely known to the decoder. As it turns out, by using a 
sufficient number of tests this issue can be resolved. Another 
difference is that in CS, the input and the measurement matrix 
are real valued and operations are performed on real numbers 
whereas in our case, the input vector, the measurement matrix 
and the arithmetic are all boolean. 

Recently, the authors in [19] investigated the probabilistic 
testing model that we consider in this paper from an informa- 
tion theoretic perspective. Unlike our combinatorial approach, 
they use information theoretic techniques to obtain bounds 
on the required number of tests. Namely, they get M = 
0(K 2 (\ogN)/p 2 ) and M = 0(K (log N)/p 2 ) measurements 
for universal and per-instance scenarios, respectively, which 
is asymptotically comparable to what we obtain in this work 
(for a fixed p and K <C N). To achieve the bounds, they 
consider typical set decoding as the reconstruction method. 
However, they do not provide a practical, low complexity 
decoding algorithm for the reconstruction. 

Another work that is relevant to ours is [ 35 1 that considers 
group testing under a "dilution" effect. This model is targeted 
for biological experiments where combining items in a group 
may cause defected items go undetected when the size of the 
group is large. In particular, their model assumes that each item 
is independently defected with a certain (fixed) probability, 
and a defected item in a group of size t affects the test with 
a probability proportional to 1 jt (thus, a "diluted" group with 
few defectives becomes more likely to test negative as its size 
grows). They analyze the number of required tests using a 
simple (but sub-optimal) test design originally proposed by 
Dorfman (3). 

III. Problem Definition 

To model the problem, we enumerate the population from 
1 to N and the tests from 1 to M. Let the nonzero entries 
of x := (xi,X2, ■ ■ ■ ,xn) € Fj; indicate the defective items 
within the population, where F2 is the finite field of order 2. 
Moreover, we assume that a; is a if -sparse vector, i.e., it has 
at most K entries equal to one (corresponding to the defective 
items). We refer to the support set of x, denoted by supp(cc), 
as the set which contains positions of the nonzero entries. 

As is typical in the literature of group testing, we introduce 
an M x N boolean contact matrix to model the set of 

non-adaptive tests. We set to one if and only if the ith 

test contains the jth item. The matrix only shows which 

2 There are, however, works that consider compressed sensing under small 
perturbations of the measurement matrix (cf. (30)). A large body of the 
compressed sensing literature considers a noise model where the measurement 
outcomes are perturbed by a real-valued noise vector, while the measurement 
matrix is exact. See, for example, [31]-p4] and the references therein. 
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tests contain which items. In particular, it does not indicate 
whether the tests eventually get affected by the defective items. 
Let us assume that when a test contains a set of defective 
items, each of them makes the test positive independently 
with probability p, which is a fixed parameter that we call 
the activation probability. Therefore, the real sampling matrix 
can be thought of as a variation of M ^ in the following 
way: Each nonzero entry of is flipped to independently 
with probability 1— p. Then, the resulting matrix is used 
just as in classical group testing to produce the test vector 



with y: 



y = M 
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where the arithmetic is boolean, i.e., multiplication with the 
logical AND and addition with the logical OR. 

The contact matrix M' c \ the test vector y, the upperbound 
on the number of nonzero entries K, and the activation 
probability p are known to the decoder, whereas the sampling 
matrix (under which the collective samples are taken) 

and the input vector x are unknown. The task of the decoder 
is to identify the K nonzero entries of x based on the known 
parameters. 

Example. As a toy example, consider a population with 6 
items where only two of them (items 3 and 4) are defective. 
We do a set of three tests, where the first one contains items 
1, 3, 5, the second one contains items 2, 4, 6, and the third one 
contains items 2, 3, 5, 6. Therefore, the contact matrix and the 
input vector have the following form 

x = ( 1 1 ) T , 

/ 1 1 1 \ 
M (c) = 10 10 1. 

\ 1 1 1 1 / 

Let us assume that only the second test result is positive. This 
means that the test vector is 

y = ( o l of. 

As we can observe, there are many possibilities for the 
sampling matrix, all of the following form: 

/?Q?0?0\ 
M (s) = ? ? ? 

\o??o??y 

where the question marks are with probability 1 — p and 1 
with probability p. It is the decoder's task to figure out which 
combinations make sense based on the outcome vector. For 
example, the following matrices and input vectors fit perfectly 



More formally, the goal of our scenario is two-fold: 

1) Designing the contact matrix so that it allows 
unique reconstruction of sparse input x from outcome 
y with overwhelming probability 1 — o(l) over the 
randomness of the sampling matrix M^ s \ 

2) Proposing a recovery algorithm with low computational 
complexity. 

We present a probabilistic approach for designing contact 
matrices suitable for our problem setting, along with a simple 
decoding algorithm for reconstruction. Our approach is to 
first introduce a rather different setting for the problem that 
involves no randomness in the way the defective items become 
active. Namely, in the new setting an adversary can arbitrarily 
decide whether a certain contact with a defective item results 
in a positive test result or not, and the only restriction on 
the adversary is on the total amount of inactive contacts being 
made. The reason for introducing the adversarial problem is its 
combinatorial nature that allows us to use standard tools and 
techniques already developed in combinatorial group testing. 
Fortunately, it turns out that by choosing a carefully-designed 
value for the total amount of inactive contacts based on the 
parameters of the system, solving the adversarial variation is 
sufficient for the original (stochastic) problem. 

Our task is then to design contact matrices suitable for the 
adversarial problem. We give a probabilistic construction of 
the contact matrix in Section [V] The probabilistic construction 
requires each test to independently contact the items with a 
certain well-chosen probability. This construction ensures that 
the resulting data gathered at the end of the experiment can 
be used for correct identification of the defective items with 
overwhelming probability, provided that the number of tests 
is sufficiently large. In our analysis, we consider two different 
design strategies 

• Per-Instance Design: The contact matrix is suitable for 
every arbitrary, but a priori fixed, sparse input vector with 
overwhelming probability. 

• Universal Design: The contact matrix is suitable for all 
sparse input vectors with overwhelming probability. 

Based on the above definitions, the contact matrix constructed 
for the per-instance scenario, once fixed, may fail to distin- 
guish between all pairs of sparse input vectors. On the other 
hand, in the universal design, one can use a single contact 
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matrix to successfully measure all sparse input vectors with 
a very high probability of success. Our results show that 
M = O (AT log (AT) /p 3 ) tests are sufficient for successful 
recovery in the per-instance scenario while we need M = 
O {K 2 \og(N/ K)/p 3 ) tests for the universal design. 

Remark: As is customary in the standard group testing 
literature, we think of the sparsity AT as a parameter that is no- 
ticeably smaller than the population size N, for example, one 
may take K = 0(N 1 ^ 4 ). Indeed, if AT becomes comparable to 
N, there would be little point in using a group testing scheme 
and in practice, for large AT it is generally more favorable to 
perform trivial tests on the items. 

IV. Adversarial Setting 



The problem described in Section III has a stochastic nature, 
i.e., the sampling matrix is obtained from the contact matrix 
through a random process. In this section, we introduce an 
adversarial variation of the problem whose solution leads us 
to the solution for the original stochastic problem. 

In the adversarial setting, the sampling matrix is obtained 
from the contact matrix by flipping up to e arbitrary entries 
to on the support (i.e., the set of nonzero entries) of each 
column of M^ c ' . The goal is to be able to exactly identify 
the sparse input vector despite the perturbation of the contact 
matrix and regardless of the choice of the flipped entries. Note 
that the classical group testing problem corresponds to the 
special case e = 0. Thus, the only difference between the 
adversarial problem and the stochastic one is that in the former, 
the flipped entries of the contact matrix are chosen arbitrarily 
(as long as there are not too many flips) while in the latter, 
they are chosen according to a specific random process. 

It turns out that the combinatorial tool required for solving 
the adversarial problem is closely related to the notion of 
disjunct matrices that is well studied in the group testing 
literature (T7). The formal definition is as follows. 

Definition 1: A boolean matrix M with A" columns 
Mi, . . . , Mm is called (AT, e)-disjunct if, for every subset S 
of the columns with \S\ < AT, and every i £ [N], we have 



suppUU,) | (J supp(M J ) 

jes\{i} 



> e 



where supp(Afj) denotes the support set of the column M$ 
and \ is the set difference operator. In words, this operation 
counts the number of nonzero positions in the column M$ for 
which all columns with index in the set S have zeros. 

Note that the special case of (AT, 0)-disjunct matrices cor- 
responds to the classical notion of AT-disjunct matrices which 
is essentially equivalent to strongly selective families and 
superimposed codes (see [36 1). Moreover, when all columns of 
the matrix have the same Hamming weight t, a (AT + 1, 2i/3)- 
disjunct matrix turns out to be equivalent to i-majority AT- 
strongly selective families that are defined in [37| (where each 
row of the matrix defines the characteristic vector of a set in 
the family). This notion is known to be useful for construction 
of non-adaptive compressed sensing schemes [37). 



The following proposition shows the relationship between 
contact matrices suitable for the adversarial problem and 
disjunct matrices. 

Proposition 2: Let M be a (AT, e)-disjunct matrix. Then 
taking M as the contact matrix solves the adversarial problem 
for A"-sparse vectors with error parameter e. Conversely, any 
matrix that solves the adversarial problem must be (AT — 1, e)- 
disjunct. 

Proof: Let M be a (AT, e)-disjunct matrix and consider 
AT-sparse vectors x and x' supported on different subsets S 
and S', respectively. Take an element i E S' which is not in 
S. By Definition [T] we know that the column Mi has more 
than e entries on its support that are not present in the support 
of any Mj, j G S. Therefore, even after e bit flips in Mi, at 
least one entry in its support remains that is not present in the 
test outcome of x 1 , and this makes x and x 1 distinguishable. 

For the reverse direction, suppose that M is not (AT — 1, e)- 
disjunct and take any i and a subset S with \S\ < K— 1, i £ S 
which demonstrates a counterexample for M being (AT— 1, e)- 
disjunct. Consider A"-sparse vectors x and x' supported on S 
and SU{i}, respectively. An adversary can flip up to e bits on 
the support of Mi from 1 to 0, leave the rest of M unchanged, 
and ensure that the test outcomes for x and x' coincide. Thus 
M is not suitable for the adversarial problem. ■ 

A. Distance Decoder 

Proposition [2] shows that a (AT, e)-disjunct contact matrix 
can combinatorially distinguish between AT-sparse vectors in 
the adversarial setting with error parameter e. In the following, 
we show that there exists a much simpler decoder for this 
purpose. 

Distance decoder: For any column c; of the contact matrix 
M^ c \ the decoder verifies the following: 



|supp(ci) \ supp(y)| < e 



(1) 



where y is the vector consisting of the test outcomes. The 
coordinate Xi is decided to be nonzero if and only if the 
inequality holds. 

The above decoder is a straightforward generalization of a 
standard decoder that is used in classical group testing. The 
standard decoder chooses those columns of the measurement 
matrix whose supports are fully contained in the measurement 
outcomes (see p7| Ch 7]). 

Proposition 3: Let the contact matrix M^ be (AT, e)- 
disjunct. Then, the distance decoder correctly identifies the 
correct support of any AT-sparse vector in the adversarial 
setting with error parameter e. 

Proof: Let x be a AT-sparse vector and S := supp(cc), 
\S\ < K, and M^ denote the corresponding set of columns 
in the sampling matrix. Obviously, all the columns in M^ 
satisfy ([1} (as no column is perturbed in more than e positions) 
and thus the reconstruction includes the support of x (this is 
true regardless of the disjunctness property of M^). Now 
let the vector y be the bitwise OR of the columns in Mg 
and assume that there is a column c of M^ outside S that 
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satisfies ([TJ. Thus, since 

supp(y) C supp(y) , 



we will have 



|supp(c) \ supp(y)| < e 



which violates the assumption that is (K, e)-disjunct for 
the support set S and the column c outside this set. Therefore, 
the distance decoder outputs the exact support of x. ■ 
Of course, posing the adversarial problem is interesting if 
it helps in solving the original stochastic problem from which 
it originates. In the next section, we show that this is indeed 
the case; and in fact the task of solving the stochastic problem 
reduces to that of the adversarial problem. 

V. Probabilistic Design 

In this section, we consider a probabilistic construction for 
M^ c \ where each entry of is set to 1 independently with 
probability q = a/K, for a parameter a to be determined later, 
and with probability 1 — q. We will use standard arguments to 
show that, if the number of tests M is sufficiently large, then 
the resulting matrix is suitable with all but a vanishing 

probability. 

By looking carefully at the proof of Proposition [3] we see 
that there are two events that may prevent the distance decoder 
with error parameter e to successfully recover the input vector 
x with support on S: 

1 ) There are more than e flips on the columns of the contact 
matrix in S. 

2) There exists a column outside S where the (K, e)-disjunct 
property is violated. 

Based on these observations, the number of tests required 
for building suitable contact matrices are given by the follow- 
ing theorem. 

Theorem 4: Consider M x N contact matrices con- 
structed by the probabilistic design procedure. If M = 
O (K \og(N) / p 3 ) for the per-instance scenario or M = 
O [K 2 \og(N / K) / p 3 ) for the universal scenario, then the 
probability of failure for the reconstruction with the distance 
decoder goes to zero as N — > oo. 

Proof: Let e be the decision-making parameter of the 
distance decoder. We first find an upperbound for the number 
of bit flips in any column of the contact matrix M^ c \ To 
this end, take any column m\ of M^ c \ Each entry of the 
column is flipped independently with probability (1— p)q 
which, on average, results in (1 — p)qM bit flips per column. 
Let e = (1 + 5)(1 —p) qM for a constant 6 > 0. By Chemoff 
bounds (cf. (38)), the probability that the amount of bit flips 
exceeds e is at most 



exp i 



-5 2 (l-p)qM/(2 + S)) 



(2) 



Second, we check the disjunctness property of the contact 
matrix for this parameter e. To this end, consider any set S of 
K columns of M^ c \ and any column outside these, say the ith 
column where i ^ S. First we upper bound the probability of 
failure for this choice of S and i. That is, the probability that 



the number of rows that have a 1 at the ith column and all- 
zeros at the positions corresponding to S is at most e. Clearly 
if this event happens the (K, e)-disjunct property is violated. 

A row is good if at that row the ith column has a 1 but all the 
columns in S have zeros. For a particular row, the probability 
that the row is good is q(l — q) K (using independence of the 
entries of the measurement matrix). Then failure corresponds 
to the event that the number of good rows is at most e. The 
distribution on the number of good rows is binomial with mean 
/i = q(l — q) K M. Using the Chernoff bound and assuming 
that e < fi (we will choose a and 6 to ensure this condition 
is satisfied), we havej^] 



failure probability < exp 



2// 



exp 



-Mq((l-q) K -(l-p)(l + S)Y 



2(1 -q) 



K 



(3) 



( <' czzp ( -Mq( (l-2a)-(l-p)(l + 6)Y 



< exp 
= exp(— M^f/K), 



2(1 -a/2) j 
^Mq((l-2a)-(l-p)(l + S)) 2 



(4) 



where we have defined 

a 
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((l-2a)-(l-p)(l + 5)Y 



The inequality (a) is due to the fact that (1 — q) K = (1 — 
a/ K) is always between 3~" and 2~ Q , and in particular for 
a 6 [0, 1], this range is strictly contained in [1 — 2a, 1 — a/2). 
Note that by choosing the parameters a and S sufficiently 
small, the quantity 

((l-2a)-(l-p)(l + ,5)) 2 

in the exponent can be made arbitrarily close to p 2 . As a 
concrete choice, however, we take 5 := p/2 and a := p/8 
which gives 

and, therefore, 

failure probability < 2- n{Mp3/K) . 



(5) 



In order to calculate the number of tests, we consider per- 
instance and universal scenarios separately. 
• Per-instance Scenario 

For the per-instance scenario, the disjunctness property 
needs to hold only for a fixed set S, corresponding to 
the support of the fixed sparse vector that defines the 
instance. Therefore, we only need to apply the union 
bound over all possible choices of i for a fixed set S. 
From d5|, the probability of coming up with a bad choice 
of M^> would thus be at most 



N2~ 



-£l(Mp 3 /K) 



The failure probability is at least 0.5 if e > fj,. 
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This probability vanishes for an appropriate choice of 

'K\ogN 



M = Q 



P 



(6) 



At the same time, using |2]i and the union bound, the 
probability that the amount of bit flips in any of the K 
columns in S exceeds e is upper bounded by 



if exp i 



-5 2 {l~p)qM/{2 
= K exp ( — f2 



a5 2 (l 



p)log(iV) 



(2 + 5)p 3 



which is vanishing (i.e., o(l)) assuming the constant be- 
hind the 0( ) notion in |6} is sufficiently large. Therefore, 
the distance decoder successfully decodes the input vector 
x with probability 1 — o(l) in the per-instance scenario 
with M = 0(K log(N) /p 3 ) tests. 
Universal Scenario 

In this case, we apply the union bound over all pos- 
sible choices of S and i. Using Q, the probability 
of coming up with a bad choice of is at most 

N(^) exp (— Mj/K) . This probability vanishes for an 
appropriate choice of 



M = 9 



K 2 log(N/K) 



e 



K 2 \og{N/K) 



7 / \ P 

At the same time, using (|2]i and the union bound, the 
probability that the amount of bit flips in any of the N 
columns of the contact matrix exceeds e is upper bounded 
by 



N exp (S 2 (l -p)qM/{2 + 5)) 



N exp 



aS 2 (l - p)K\og{N/K) 
(2 + <5)7 



(1). 



Therefore, with M = O (K 2 \og(N/K)/p 3 ) tests, the 
probabilistic design constructs a contact matrix such that 
the distance decoder is able to decode all sparse input 
vectors x with probability 1 — o(l). 

■ 

The probabilistic construction results in a rather sparse 
contact matrix, namely, one with density 0(1/ K) that decays 
with the sparsity parameter K. In the following, we show that 
sparsity is necessary for the probabilistic construction to work. 

Proposition 5: Let M be an Af x JV boolean random ma- 
trix, where M = (K 2 \og{N/K)) or M = 0{K\og{N)) 
for an integer K > 0, which is constructed by setting 
each entry independently to 1 with probability q. Then either 
q = O (log K/K) or otherwise the probability that M is 
(K, e)-disjunct (for any e > 0) approaches to zero as N grows. 

Proof: Suppose that M is an Mx N matrix that is (K, e)- 
disjunct. Observe that, for any integer t 6 (0, K), if we remove 
any t columns of M and all the rows on the support of those 
columns, the matrix must remain (K — t, e)-disjunct. This is 
because any counterexample for the modified matrix being 
(K — t, e)-disjunct can be extended to a counterexample for 
M being (K, e)-disjunct by adding back the removed columns 
and rows. 



Now consider any t columns of M, and denote by Mq the 
number of rows of M at which the entries corresponding to 
the chosen columns are all zeros. The expected value of M 
is (1 — qfM. Moreover, for every 5 G (0, 1) we have 

Pr [M > (1 + S)(l - qfM] < exp ~(1 - q) l M 

' (7) 

by the Chernoff bound. Let to be the largest integer for which 

(l + (5)(l-9) to M>logA^. 

If to < K — 1, we let t := to + 1 above, and this makes 
the right hand side of |7ji upper bounded by o(l). So with 
probability 1 — o(l), the chosen t columns of M will keep 
Mo at most (1 + <5)(1 — g)*M. Removing those columns and 
all the rows on the support of these columns leaves the matrix 
(K — t, e)-disjunct, which obviously requires at least logA^ 
rows as even a (1, 0)-disjunct matrix needs so many rows. 
Therefore, we must have 

(l + 5)(l-q) f M > log AT. 

However, this inequality is not satisfied by the assumption on 
to. So if to < K — l, little chance remains for M to be (K, e)- 
disjunct for any e > 0. Therefore, we should have to > K — l. 
Using the condition on to, we have 

(1 + 5){1 - qj^M > log N . 



This is equivalent to 



log (M(l + <5)/logAQ 
K - 1 



which for M = O (K 2 log (N/ K)) or M = 0{K\og{N)) 
gives q = O (log K/K) . ■ 
In summary, our results indicate that for both the per- 
instance and universal settings, the activation probability p 
increases the upper bound on the number of tests by a factor 
of 1/p 3 . Moreover, we can use the simple distance decoder 
to recover the unknown input vector with the complexity 
of O(MN). However, in order for the probabilistic design 
to work, we should choose a flip probability q such that 
q = O (log K/K). In fact, our choice of q — a/K for a 
constant a satisfies this requirement. 

VI. System Design and Simulation Results 

In this section, we provide a systematic design procedure 
which gives us the number of tests necessary for the decoding 
process to be successful. While the design procedure applies to 
both per-instance and universal scenarios, the numerical simu- 
lation result is provided only for the per-instance setting, since 
evaluating the universal design requires to test all possible 
inputs which is computationally prohibitive. 

According to the discussion in Section [V] there are two 
types of failure events which we want to avoid in designing 
the contact matrix M^ c \ The first failure event, denoted as 
fi, happens when the number of bit flips in a column is not 
tolerable by the contact matrix and the second event, denoted 
as /2, relates to the violation of the disjunctness property of 
the matrix. The inputs to the design procedure are N, K, 
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p, Pf x and pj 2 , where the last two parameters denote the 
maximum tolerable probability for the first and second failure 
events, respectively. Then, the design procedure should provide 
us with the quantities M, q and e, which are the required 
parameters to setup the sensing and recovery algorithms. 
Let us summarize the results of the probabilistic design of 
Section [V] First we define 77 from ^ as 

((l-q) K -(l-p)(l + S)) 2 
V = 1 KT-i Z\k • (8) 



2(1 -q) 



Then, for the per-instance scenario (which is denoted by (i)), 
we have 




Probability of Failure 



pI < 1- 



1 — exp 



-(1 - p) qM®{\og{\ + 5) {1+5) - <5)) 



A" 



Fig. 5. The number of tests as a function of the failure probability for 
the per-instance scenario. The parameters are N = lOO'OOO, K = 10 and 
activation probability p = 0.8. 



P% < Nexp(-M & r)) , 
e (i) = (l + r5)(l-p) qM (i) . 

For the universal strategy (which is denoted by (u)), we have 



< 1- 



1 — exp 



(-(1 - p) q Af (u) (log(l + 5f 1+ ^ - *)) 



N 



Ph ^ N 



exp i 



e (u) = (l + 6)(l-p) qM (u >. 

Note that since the first failure event happens independently 
on the columns, we have used a more precise expression for 
this failure probability which does not use the union bound, 
and also makes use of the exact expression for the Chemoff 
bound. 

Let us provide the details of the design for the per-instance 
scenario; The universal design follows the same lines. For any 
fixed value of a, S > is the only parameter which should 
be determined such that the failure probabilities fall below the 
maximum tolerable values. To this end, we initialize the value 
of 5 to zero and increase it in small steps up to the value S m!ix . 
Given that e = (1 - + 5) qM® and \i = g(l - q) K M (i) 
and under the condition that e < fi, we have 

5 max = (l-q) K /(l- P )-l. 

For any value of 6 < 5 msix and given the maximum tolerable 
probability for the second failure event p^ 2 , the number of tests 
are computed as 

M® = V - 1 log (N/ Pf2 ) 

where 77 is defined in (|8j. This is then used to compute the 
corresponding probability for the first failure event as 



»® - 1- 
Pfl - 1 



1 - exp (-(1 - p) q M (i) (log(l + 5) 



(1+6) 



S) 



K 



We continue increasing 5 until p9 falls below the maximum 
tolerable probability for the first failure event p^ . This pro- 
vides us with the number of tests M® and the error parameter 
e w for the chosen value of a. This whole process is continued 
for different values of a in the range [0, a max ]. At the end, we 
find the value of a which results in the minimum number of 
tests for the given value of p. This provides us with the number 
of tests and the decision parameter for the distance decoder. 



Figures 3(a) and 3(b) show the number of tests, for universal 



and per-instance strategies, as a function of the parameter a 
and the activation probability p. The population size is N = 
lOO'OOO, the number of defective items is K = 10 and the 
maximum tolerable probabilities for the two failure events are 
set to p* = = 0.001 . Note that the number of tests for the 
per-instance scenario is much less than the universal scenario 
and moreover, it allows us to have designs appropriate for 
smaller activation probabilities. The black curve in each figure 
connects the points with minimum number of tests for each 
value of the activation probability p, which in turn provides 
us with the appropriate value for the parameter a. The black 
curves are extracted and shown separately in Figure 3(c) In 



Figures 4(a) 4(b) and 4(c) we show the output of the design 
procedure for N = 100'000'000, K = 500 and p fi = p h = 
0.001. 

In Figure |] we set N = lOO'OOO, K = 10 and p = 0.8 
and use the design procedure to plot the number of tests as 
a function of the probability of failure in the per-instance 
strategy. Then, in Figure |6| we run a numerical experiment 
with the same values for the parameters N, K and p to assess 
the performance of the recovery algorithm, with the results 
averaged over 4000 trials. We set the parameters e and a for 
the numerical experiment equal to those which give us the 
probability of failure of 0.5 in Figure [5] (which are e = 40 and 
a = 0.44) and change the number of tests. Note that although 
we expect a failure probability of around 0.5 for M — 3000 
tests according to Figure [5] the recovery performance is much 
better in numerical simulations. This can be explained by 
noting that the upper bounds for the failure probabilities are 
not tight in general. 
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Fig. 3. The number of tests (M) as the output of the design procedure for |(a)| per-instance and |(b)| universal strategies, as a function of the parameter a 
and the activation probability p. The parameters are set to N = lOO'OOO, K 10 and p, = p. = 0.001 . The black curves provide us with the value 
of a which gives the minimum number of tests for each value of the activation probability p. |(c)| The minimum number of tests (corresponding to the black 
curves in|(a)|and|(b)| for universal and per-instance strategies, as a function of the activation probability p. 
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Fig. 6. The simulated probability of exact recovery for the per-instance 
strategy for N = lOO'OOO, K = 10 and p = 0.8, averaged over 4000 
trials. Although the design procedure expects a high probability of failure for 
M = 3000 tests (see Figure [5}, the recovery performance is much better in 
numerical simulations. 



VII. Conclusion 



We studied the problem of identifying a small number 



of defective items among a large population, using collec- 
tive samples. With the viral epidemic application in mind, 
we investigated the case where the recovery algorithm may 
possess only partial knowledge about the sampling process, 
in the sense that the defective items can become inactive 
in the test results. We showed that by using a probabilistic 
model for the sampling process, one can design a non-adaptive 
contact matrix which leads to the successful identification of 
the defective items with overwhelming probability 1 — o(l). 
We considered two strategies for the design procedure. In 
per-instance design, the contact matrix is suitable for each 
sparse input vector with overwhelming probability while in 
universal design, this is true for all sparse inputs. To this end, 
we proposed a probabilistic design procedure which requires 
a "small" number of tests to single out the sparse vector 
of defective items. More precisely, we showed that for an 
activation probability p, the number of tests sufficient for 
identification of up to K defective items in a population 
of size N is given by M — 0(Klog(N)/p 3 ) for the per- 
instance scenario and M = 0(K 2 \og(N/ K)/p 3 ) for the 
universal scenario. Moreover, we proposed a simple decoder 
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Fig. 4. The number of tests (M ) as the output of the design procedure for |(a)| per-instance and [(b)] universal strategies, as a function of the parameter a and 



the activation probability p. The parameters are set to N = IOO'000'OOO, 



500 and p 



fi 



p, = 0.001 . The black curves provide us with the value 
of a which gives the minimum number of tests for each value of the activation probability "p. |(c)| The minimum number of tests (corresponding to the black 
curves in|(a)|and|(b)| for universal and per-instance strategies, as a function of the activation probability p. 



which is able to successfully identify the defective items with 
complexity of O(MN). Finally, we provided a systematic 
design procedure which gives the number of tests M, along 
with the design parameters a and e, required for successful 
recovery. As expected, the numerical experiments showed 
that the number of tests provided by the design procedure 
overestimates the true one required to achieve a specified 
probability of failure. As a complement to this work, one can 
also consider the effects of false positives and false negatives 
on the required number of tests. We leave this issue for future 
work. 
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